learning compositional function
Learning compositional functions via multiplicative weight updates
Compositionality is a basic structural feature of both biological and artificial neural networks. Learning compositional functions via gradient descent incurs well known problems like vanishing and exploding gradients, making careful learning rate tuning essential for real-world applications. This paper proves that multiplicative weight updates satisfy a descent lemma tailored to compositional functions. Based on this lemma, we derive Madam---a multiplicative version of the Adam optimiser---and show that it can train state of the art neural network architectures without learning rate tuning. We further show that Madam is easily adapted to train natively compressed neural networks by representing their weights in a logarithmic number system. We conclude by drawing connections between multiplicative weight updates and recent findings about synapses in biology.
Review for NeurIPS paper: Learning compositional functions via multiplicative weight updates
Weaknesses: I was not totally convinced by the experiments section, and have questions about that section and some more general questions which the authors might address: 1. The way that Figure 1 is laid out suggests that it is appropriate to compare the three algorithms over the same set of values of eta. Can the authors justify this? It seems to me that the meaning of eta in the Madam algorithm is different to its meaning in SGD and Adam (it's effectively a coincidence that these different hyper-parameters share a name). What happens if you evaluate Madam over a denser grid of eta values and then zoom in the x axis of the left hand plot? 2. The value of the transformer, on the wikitext-2 task, for SGD and Madam, seems very high. Perhaps the authors are using a different unit of measurement?
Review for NeurIPS paper: Learning compositional functions via multiplicative weight updates
This is a good paper which combines insights from optimization, hardware, and neuroscience to give a multiplicative weight update for neural nets. It seems worthwhile to try out multiplicative updates in the context of modern architectures, and this paper seems to have made them competitive with existing optimizers, in a way that allows lower-precision computation (as low as 8 bits). As far as I can tell, there isn't a clear advantage for current hardware, but this serves as a good proof-of-concept that could help inform future hardware design. While no particular insight is particularly deep, everything is combined in an interesting and cohesive way, so the reviewers and I think this paper is definitely above the bar for acceptance. I encourage the authors to account for the reviewers' feedback in the camera ready version.
Learning compositional functions via multiplicative weight updates
Compositionality is a basic structural feature of both biological and artificial neural networks. Learning compositional functions via gradient descent incurs well known problems like vanishing and exploding gradients, making careful learning rate tuning essential for real-world applications. This paper proves that multiplicative weight updates satisfy a descent lemma tailored to compositional functions. Based on this lemma, we derive Madam---a multiplicative version of the Adam optimiser---and show that it can train state of the art neural network architectures without learning rate tuning. We further show that Madam is easily adapted to train natively compressed neural networks by representing their weights in a logarithmic number system.